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1, INTRODUCTION 

Cardiovascular disease (CVDs) are a group of disorders of the heart and blood vessels and they include two 
major sub-type of disease: coronary heart disease (CHD) and cerebrovascular disease (stroke). Despite the advances 
mn clinical care and medicine, cardiovascular diseases (CVDs) are continuing to be the principal cause of morbidity 
and mortality. Cardiovascular disease (CVDs) are a group of disorders of the heart and blood vessels and they include 
two major sub-type of disease: coronary heart disease (CHD) and cerebrovascular disease (stroke). The primary cause 
of CHD is atherosclerosis that reduces blood flow through the coronary arteries to the heart muscle and therefore 
malfunctioning starts in heart. CHD is now the leading cause of death worldwide. As estimated 3.8 million men and 
3.4 million women die each year from CHD [1]. In developed countries heart disease is the leading cause of death 
m1 men and women [2, 3]. However, according to a report of world health organization (WHO) in 2005, 
the prevalence and mortality due to coronary heart disease (CHD) is declining in the developed nations. 
Although rates of CVD have long since peaked for many developed countries and mortality from the disease 
is declining, it still accounts for almost a third (32.8%) of all deaths in the U.S. and a large majority of cardiac deaths 
in the U.S. are due to coronary heart disease (CHD) [4]. 

In 2009, CHD accounted for 64% of all Cardiac deaths in the U.S [4]. Cognitive decline and CVDs 
share many vascular risk factors (VRFs) such as smoking, hypertension, and diabetes mellitus; furthermore, 
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CVDs can contribute to cognitive decline by causing cerebral hypoperfusion, hypoxia, emboli, or infarcts [5]. 
Also, hypertension is a leading modifiable risk factor for heart disease and stroke [6-8]. 
Again, for developing hypertension there are some factors [9], which may not directly related with CHD but 
secondarily associated with CHD. Therefore, it is important to identify the associated risk factors to predict 
the risk of coronary heart disease (CHD) among individuals. Several research [10-12] has been conducted 
so far to identify the association of risk factors to coronary heart diseases (CHD). The focus of this study was 
to develop a binary logistic regression model to explore the relationship between the associated risk factors 
such as gender, age, number of cigarettes consumptions, total cholesterol level, glucose level etc. 
with 10-year risk of coronary heart disease (CHD) for the residents in Framingham, Massachusetts, U.S, 
as our outcome variable is a dichotomous in nature. 

In recent years, the rate of CHD 1s increasing rapidly and researchers of U.S are engaging 
themselves to identify the factors of future risk of coronary heart diseases (CHD) [13-18]. 
The associated Research Questions are: Are men more susceptible to heart disease than women’? 
does increased age and number of cigarettes smoked, show increasing odds of having heart disease? 
does increase in total cholesterol level and glucose level, increases the likelihood of having 10-year risk 
of coronary heart disease CHD?. The organization of the paper was as follows: In Section 2, we described 
data collection procedure and description of variables used in the analysis. Section 3 discussed about 
the methodology and analysis procedure. In Section 4, we discussed results of the final fitted logistic 
regression model. This paper ends with conclusion and possible discussion about the findings in Section 5. 


2. RESEARCH METHOD 

The dataset is publicly available on the Kaggle website [19], and it is an ongoing cardiovascular 
study on residents of the town of Framingham, Massachusetts. The dataset provides the patients’ information. 
To answer associated research questions and to obtain the best fit model, our dataset consists over 
4,250 records and 15 attributes. Following data description provides the variable name and its description. 
Each attribute is a potential risk factor. There are both demographic, behavioral and medical risk factors: 
Sex: male or female of patient; Age: age of the patient, CurrentSmoker: whether the patient is a current 
smoker (yes, no), CigsPerDay: the number of cigarettes the person smoked on average in one day, BPMeds: 
whether the patient was on blood pressure medication (yes, no), prevalentStroke: whether the patient had 
previously had a stroke (yes, no), prevalentHyp: whether the patient was hypertensive (yes, no), diabetes: 
whether the patient had diabetes or not (yes,no); totChol: total cholesterol level of patient, sysBP: 
systolic blood pressure of patient; diaBP: diastolic blood pressure of patient; BMI: Body Mass Index 
of patient; glucose: glucose level of patient, heartRate: heart rate (continuous-In medical research, 
variables such as heart rate though in fact discrete, yet are considered continuous because of large number 
of possible values.) Predict variable (desired target): 10-year risk of coronary heart disease CHD 
(binary: “1”, means “Yes”, “0” means “No’’). In the dataset, there were missing values for CigsPerDay (29), 
BPMeds (53), TotChol (50), BMI (19), heartRate (1), and glcLevel (388). We did not consider any missing 
value analysis technique in our data analysis section. 

It is customary to use Binary Logistic Regression as our outcome variable is dichotomous or binary 
(1=the patient has 10-year risk of CHD, O=the patient does not have 10-year risk of CHD). 
The Framingham dataset consists of binary (or nominal) and continuous independent variables [20-23]. 
To select the best model for our analysis we used purposeful selection of covariates to minimize 
the number of variables (parsimony) in the model such that the resultant model is more likely 
to be numerically stable and is more easily generalized. First, we run the univariate analysis for each 
of the covariates, to select out preliminary main effects model. We selected all the variables as they are all 
significant at 20% level of significance that can be seen in Table 1. 

Then we run the multivariable analysis with all the selected variables and found CurrentSmoker 
(p=0.6326), BPMeds (p=0.5330), PrevalentStroke (p=0.1415), PrevalentHyp (p=0.1162), Diabetes (p=0.9944), 
DiaBP (p=0.5332), BMI (p=0.4085), and HeartRate (p=0.5833) were not statistically significant that can 
be seen in Table 2, we excluded these variables for our main effect model. However, BMI, Diabetes, 
CurrentSomker and PrevalentHyp are clinically more significant for developing coronary heart diseases 
(CHD) [24], therefore, our preliminary main effect model 1s: 


g(x) = n( 22) = 


1-1(x) 
Bo + B,Sex + B, Age + Bz BMI + B, CurrentSmoker + B. CigsPerDay + 
£B, PrevalentHyp + fb, Diabetes + bg SysBP + By TotChol + £1) glcLevel (1) 
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; Odds 95 % CI for 

Parameter Estimate Standard Error Wald Chi-Square Pr>ChiSq Ratio Galerie 
Sex 0.49916 0.0859 32.92 <.0001 1.635 1.382 1.935 
Age 0.0746 0.00527 201.016 <.0001 1.078 1.066 1.089 
CurrentS moker 0.1084 0.8560 1.6028 0.2055 1.114 0.942 1.368 
CigsPerDay 0.0127 0.00340 13.96 0.0002 1.013 1.006 1.020 
BPMeds 1.0635 0.1960 29.44 <.0001 2.896 1.973 4.253 
PrevalentStroke 1.4921 0.4052 13.5601 0.0002 4.447 2.010 9.899 
PrevalentHyp 0.9837 0.0872 127.26 <.0001 2.674 2.254 3.173 
Diabetes 1.1219 0.2035 35.89 <.0001 3.385 2.272 5.045 
TotChol 0.0048 0.00093 2145 <.0001 1.005 1.003 1.007 
SysBP 0.0241 0.00180 178.445 <.0001 1.024 1.021 1.028 
DiaBP 0.0317 0.00342 86.06 <.0001 1.032 1.025 1.039 
BMI 0.0486 0.0099 23.65 <.0001 1.050 1.029 1.071 
HeartRate 0.0052 0.00351 2229 0.1360 1.005 0.998 1.012 
glcLevel 0.0105 0.00157 44.588 <.0001 1.011 1.007 1.014 


Table 2. Final model fit to the framingham CHD study data, n=3803 


Parameter Coeff. Std. Err. Z p 
Intercept -8.7215*** 2.3584 13.675 0.0002 
Sex OS276*"* 0.1066 24.5117  <0.0001 
Age OL LST 0.0331 11.8026 0.0006 
CurrentSmoker 0.0721 0.1534 0.2212 0.6381 
CigsPerDay 0.0181** 0.00606 8.9462 0.0028 
PrevalentHyp 0.2155 0.1330 2.6241 0.1052 
Diabetes -.7789 0.6041 1.3756 0.2408 
TotChol 0.00242 0.00973 0.0616 0.8040 
SysBP 0.0139*** 0.00284 23.8504 <.0001 
BMI -0.0322 0.0641 0.2520 0.6157 
glcLevel -0.0132 0.0110 1.4463 0.2291 
Age*TotChol -0.00019 0.000136 2.0569 0.1515 
glcLevel*TotChol 0.000070* 0.000042 2.7631 0.0965 
BMI*TotChol 0.000169 0.00260 0.4210 0.5165 
Diabetes*glcLevel 0.00668 0.00482 1.9215 0.1657 


Almost all our independent variables (such as Age, BMI, CigsPerDay, SysBP, TotChol and 
glcLevel) are continuous, we checked the assumption of linearity in the logit for these variables. 
All other variables were linear in the logit except for glcLevel. We tried different transformations approache 
for this variable but none of the transformations support the assumption. We retain our original variable 
as it is for entering the final model. Then we included different pairs of interactions one by one into the main 
effects model which are clinically meaningful and keep only important pair of interactions such 
as age*totChol, bmi*totChol, diabetes*glcLevel and glcLevel*totChol based on the Wald statistic value that 
can be seen in Table 3, therefore our final preliminary model 1s: 


g(x) = Bo + B,Sex + B, Age + B, BMI + B, CurrentSmoker + B. CigsPerDay + 
B, PrevalentHyp + fb, Diabetes + Bg SysBP + By TotChol + f,) glcLevel+ f£,,age * 
TotChol + B,, glcLevel * TotChol + 6,3, BMI * TotChol + £,, Diabetes * glcLevel (2) 


In order to assess the fit of the model, from the Hosmer—Lemeshow goodness of fit statistic 
the corresponding p-value computed from the chi-square distribution with 8 degrees of freedom is 0.7591. 
Therefore, we do not reject the null hypothesis, and this indicates that the model seems to fit quite well. 
Deviance and Pearson Goodness-of-Fit Statistics also suggests that the model fits well as the p-values are 
1.000 and 0.3396 respectively. The Percent Concordant has a high value of 74.2. This suggests that the model 
is good. We also check the Somers' D and Gamma statistic where both have moderate value 0.481. 
The area under the ROC curve of our selected model is 0.7416. This suggests that our selected model can 
describe discrimination appropriately and comparing with AUC of ROC for other models, our model 
is performing better in describing the discrimination. Therefore, we can say that our final model fits well. 
In order to find the subjects which seems to be poorly fit or over influential, we check the plots of Influence 
Diagnostic and predicted probability diagnostic as can be seen in Figure | and found that the subject ID 963 
and ID 3488 are influential. The plot of standardized Pearson residuals against predicted values also suggests 
that the residuals are quite independent as can be seen in Figure 2. 
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Age*Currentsmoker 


Age*CigsPerDay 
Age*PrevalentHyp 
Age*Diabetes 
Age*TotChol 
Age*SysBP 
Age*glcLevel 
Sex*totChol 
Sex*SysBP 
PrevalentH yp*glcLevel 
BMI*glcLevel 
TotChol*glcLevel 
BMI*TotChol 
Diabetes*glcLevel 


DiBeta for Intercept 


One Step Difference in Deviance 


Parameter 


Age*BMI 


Coeff. 
-0.0133 
-0.00066 
0.000269 
-0.00142 
-0.0284 
-0.00018 
-0.00015 
0.000020 
0.00261 
0.00593 
0.00496 
0.000159 
0.000067 
0.000277 
0.00612 


Std. Err. 
0.0122 
0.00137 
0.000479 
0.0122 
0.0319 
0.000134 
0.000259 
0.00023 1 
0.00215 
0.00412 
0.00342 
0.000296 
0.000039 
0.00025 1 
0.00467 


Wald Chi-Sqr 
1.1879 
0.2336 
0.3158 
0.0135 
0.7909 
1.8005 
0.3247 
0.0078 
1.4689 
2.0692 
2.1034 
0.2893 
2.889 
0.7499 
1.7157 


Table 3. Interactions included one by one into the main effect model 


P>Chi-Sqr 
0.2756 
0.6289 
0.5741 
0.9076 
0.3738 
0.1796 
0.5582 
0.9298 
0.2255 
0.1503 
0.1470 
0.5907 
0.0892 
0.3865 
0.1902 
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Figure 1. Influence diagnostic and predicted probability 
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Figure 2. Standardize Pearson residuals 


3. RESULTS AND DISCUSSION 

Each of our fourteen covariates was statistically significant in the univariate analysis with 20% level 
of significance. The variable Prevalent Stroke has high odds ratio (4.447 odds of having 10-year risk 
of CHD for those who have previous history of stroke compared to those who do not have) compared to other 
covariates in the univariable analysis of Logistic regression that can be seen in Table 1. But in final Logistic 
Regression model, after excluding BPMeds, Prevalent Stroke, DiaBP and Heartrate, and adding four 
interactions that are clinically significant, only one (totChol*glcLevel) among them are clinically and 
marginally statistically significant, we found that CurrentSmoker (z=0.2212, p=0.6381). 

PrevalentHyp(z=2.6241, p=0.1052), BMI, Diabetes, totChol, glcLevel, Diabetes*glcLevel, 
BMI*totChol, and age*totChol were not statistically significant, but we keep them in our model as they have 
clinical significance. Age is statistically significant (z=11.8026, p=0.0006) and the effect of age is modified 
by the effect of total cholesterol levels of the patients. When all the modification effects are at level of zero, 
the increase in age by 5 years increases the odds of having 10-year risk of CHD by 76% 
[exp(5*0.1137)=1.76]. Sex is highly statistically significant (z=24.5117, p<0.0001) and no other covariates 
modifies Sex effects. Also from Table 4 we observed the odds ratio for Sex (OR=1.695, CI=(1.375, 2.089)) 
with confidence interval, indicates that odds of having 10-year risk of CHD among males are 69.5% more 
than females by controlling all other independent variables. More specifically, males are more susceptible 
of having 10-year risk of CHD than females. 

CigsPerDay is statistically significant (z=8.9462, p=0.0028) and no other covariates modifies 
CigsPerDay effects on outcome. Therefore, the patients who has taken CigsPerDay by 5(say) units higher 
than other patients, the odds of having 10-year risk of CHD for the patients with higher frequency 
of CigsPerDay is 9.4% [exp(5*0.0181)=1.094] higher than the patients with lower consumption 
of CigsPerDay. To the end, higher consumption of CigsPerDay increases the likelihood of having 10-year 
risk of CHD for our study population. SysBP is statistically significant as can be seen in Table 2, z=23.8504, 
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p<0.0001) and no other covariates modifies SysBP effects on outcome variable. Therefore, the patients who 


has SysBP by 5(say) units higher than other patients, the odds of having 10-year risk of CHD for the patients 
with higher SysBP is 7.2% [exp(5*0.0139)=1.072] higher than the patients with lower SysBP. 


Table 4. Odds ratio estimates and profile-likelihood confidence intervals 


Effect Unit Estimate 95% Confidence limits 
Sex 1.0 1.695 Lor 2.089 
currentSmoker 1.0 1.075 0.796 1.452 
cigsPerDay 1.0 1.018 1.006 1.030 
prevalentHyp 1.0 1.240 0.956 1.610 
sysBP 1.0 1.014 1.008 1.020 


TotChol and BMI are clinically significant factor for cardiovascular diseases, however, 
they are not statistically significant in our case. Similarly, diabetes and glcLevel are potential risk factor for 
developing the cardiovascular disease, but these are not statistically significant. Only interaction that are 
clinically and marginally statistically significant for our data set is totChol*glcLevel (z=2.7631, p=0.09). 
It means that the effect of totChol for patients with 10-year risk of CHD varies according to different glucose 
level, keeping BMI and age values at average level that can be seen in Figure 3. To the end, the increase 
in total cholesterol level and glucose level, increases the likelihood of having 10-year risk of coronary heart 
disease CHD for our study population. Also, the effect of Diabetes was modified by the effect of glucose 
level. For subjects who have glucose concentration level above 100 and have Diabetes mellitus, the odds 
of 10-year risk of having CHD is much higher for them as can be seen in Figure 4. 
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Figure 3. Odds ratio of interaction effect of totchol*glclevel in final model 
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Figure 4. Odds ratio of interaction effect diabetes*glclevel in the final model 
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Our final model is both statistically and clinically significant and the goodness of model fit also 
support the model ability to perform well as the Area Under the ROC curve is moderately high 
to discriminate between those subjects who experience the outcome of interest and versus those who do not. 
In diagnostic checking using the standardized Pearson residuals and other techniques like DFBETAS, 
DIFDEV and DIFCHISQ, the model exhibits a good fitting considering no influential and outliers were 
present in our dataset. All the predictors of the model are clinically meaningful and then we see that except 
CurrentSomoker, PrevalentHyp, BMI, Diabetes, totChol, glcLevel, Diabetes*glcLevel, BMI*totChol, 
and age*totChol all other covariates are statistically significant in our final model. We found Sex, 
CigsPerDay, SysBP and Age which are risk factors for having cardiovascular disease, and are highly 
influential in predicting CHD that supports the literature [25-27]. We also found that the increased effect of age 
increased the likelihood of having CHD which supports the recent findings [28]. In addition, the effect 
of totChol 1s modified by the effect of glucose concentration level of the patients that also supports literature [29]. 
Moreover, the effect of Diabetes is modified by the effect of glucose concentration level for patients [30]. 
We found that the odds ratio of having CHD 1s highest at 1.249 (CI: 0.628, 2.485) when the glucose 
concentration level is high as of 150 units with diabetes cases which also supports the findings [31]. 


4. CONCLUSION 

All the analysis has been done using SAS 9.2 software.From our analysis, we found that the 
coefficient of the coefficient of BMI, Diabetes, Glucose level, Age*glcLevel are negative which suggests that 
the odds of having 10-year risk of CHD will decrease with the increased value of those terms. On the other 
hand, the coefficient of SEX, Age, CurrentSmoker, CigsPerDay, PervalentHyp, totChol, SysBP, 
totChol*glcLevel, diabetes*glcLevel, BMI*totChol have positive signs, therefore the odds of having 10-year 
risk of CHD will increase with the increased value of those terms. The higher the magnitude of these 
coefficients the odds of having CHD will increase or decrease based on the positive or negative sign, 
respectively, of the respective coefficients. In conclusion, we found that Sex, Age, CigsPerDay and SysBP 
are important risk factors in predicting 10-year risk of CHD for Framingham study population and the effects 
of Age was modified with the totalChol level. However, the effect of totChol also 1s modified with glucose 
concentration level and BMI factors. The increased Age and CigsPerDay increases the odds of having 10- 
year risk of CHD, but the noticeable finding is that patients with Diabetes at higher level of glcLevel have the 
higher odds of having 10-year risk of CHD than with low level of glucose concentration. The limitation of 
this study was that the variable glcLevel had highest number of missing values. With improving this 
limitation and including other important covariates a better prediction model could be developed to identify 
risk of having CHD for Framingham study population. 
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